55 research outputs found

    Optimizing model-agnostic Random Subspace ensembles

    Full text link
    This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores

    Context-dependent feature analysis with random forests

    Full text link
    In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.Comment: Accepted for presentation at UAI 201

    Statistical interpretation of machine learning-based feature importance scores for biomarker discovery

    Full text link
    Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive

    Inferring Regulatory Networks from Expression Data Using Tree-Based Methods

    Get PDF
    One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions

    From global to local MDI variable importances for random forests and when they are Shapley values

    Full text link
    peer reviewedRandom forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification and regression problems

    Combining tree-based and dynamical systems for the inference of gene regulatory networks

    Get PDF
    Motivation: Reconstructing the topology of gene regulatory networks (GRNs) from time series of gene expression data remains an important open problem in computational systems biology. Existing GRN inference algorithms face one of two limitations: model-free methods are scalable but suffer from a lack of interpretability and cannot in general be used for out of sample predictions. On the other hand, model-based methods focus on identifying a dynamical model of the system. These are clearly interpretable and can be used for predictions; however, they rely on strong assumptions and are typically very demanding computationally. Results: Here, we propose a new hybrid approach for GRN inference, called Jump3, exploiting time series of expression data. Jump3 is based on a formal on/off model of gene expression but uses a non-parametric procedure based on decision trees (called "jump trees") to reconstruct the GRN topology, allowing the inference of networks of hundreds of genes. We show the good performance of Jump3 on in silico and synthetic networks and applied the approach to identify regulatory interactions activated in the presence of interferon gamma. Availability and implementation: Our MATLAB implementation of Jump3 is available at http:// homepages.inf.ed.ac.uk/vhuynht/software.html

    Distinct blood protein profiles associated with the risk of short-term and mid/long-term clinical relapse in patients with Crohn's disease stopping infliximab: when the remission state hides different types of residual disease activity.

    Full text link
    peer reviewed[en] OBJECTIVE: Despite being in sustained and stable remission, patients with Crohn's disease (CD) stopping anti-tumour necrosis factor α (TNFα) show a high rate of relapse (~50% within 2 years). Characterising non-invasively the biological profiles of those patients is needed to better guide the decision of anti-TNFα withdrawal. DESIGN: Ninety-two immune-related proteins were measured by proximity extension assay in serum of patients with CD (n=102) in sustained steroid-free remission and stopping anti-TNFα (infliximab). As previously shown, a stratification based on time to clinical relapse was used to characterise the distinct biological profiles of relapsers (short-term relapsers: 6 months). Associations between protein levels and time to clinical relapse were determined by univariable Cox model. RESULTS: The risk (HR) of mid/long-term clinical relapse was specifically associated with a high serum level of proteins mainly expressed in lymphocytes (LAG3, SH2B3, SIT1; HR: 2.2-4.5; p<0.05), a low serum level of anti-inflammatory effectors (IL-10, HSD11B1; HR: 0.2-0.3; p<0.05) and cellular junction proteins (CDSN, CNTNAP2, CXADR, ITGA11; HR: 0.4; p<0.05). The risk of short-term clinical relapse was specifically associated with a high serum level of pro-inflammatory effectors (IL-6, IL12RB1; HR: 3.5-3.6; p<0.05) and a low or high serum level of proteins mainly expressed in antigen presenting cells (CLEC4A, CLEC4C, CLEC7A, LAMP3; HR: 0.4-4.1; p<0.05). CONCLUSION: We identified distinct blood protein profiles associated with the risk of short-term and mid/long-term clinical relapse in patients with CD stopping infliximab. These findings constitute an advance for the development of non-invasive biomarkers guiding the decision of anti-TNFα withdrawal

    dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data

    Full text link
    Abstract The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability
    • …
    corecore